Add benchmark metrics persistence#2339
Conversation
Reviewer's GuideAdds persistent benchmark metrics collection, storage, and querying to RamaLama, including a new File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @ieaves, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the benchmarking capabilities of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Hey - I've found 1 security issue, 5 other issues, and left some high level feedback:
Security issues:
- Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- In
benchmarks_list_cli, whenargs.format == 'json'you pass a generator ((asdict(item) for item in results)) tojson.dumps, which will fail serialization; convert to a list (e.g.,[asdict(item) for item in results]) before dumping. - In
normalize_benchmark_record, the error message usestype(BenchmarkRecord)instead of the actual instance type; update it totype(benchmark)so the raisedNotImplementedErrorreports the correct offending type. - The
ramalama-benchmarksman page describes results being stored in a SQLite database with adb_path, butBenchmarksManagercurrently writes JSONL tobenchmarks.jsonl; consider aligning the implementation with the documented SQLite behavior or adjusting the configuration naming to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `benchmarks_list_cli`, when `args.format == 'json'` you pass a generator (`(asdict(item) for item in results)`) to `json.dumps`, which will fail serialization; convert to a list (e.g., `[asdict(item) for item in results]`) before dumping.
- In `normalize_benchmark_record`, the error message uses `type(BenchmarkRecord)` instead of the actual instance type; update it to `type(benchmark)` so the raised `NotImplementedError` reports the correct offending type.
- The `ramalama-benchmarks` man page describes results being stored in a SQLite database with a `db_path`, but `BenchmarksManager` currently writes JSONL to `benchmarks.jsonl`; consider aligning the implementation with the documented SQLite behavior or adjusting the configuration naming to avoid confusion.
## Individual Comments
### Comment 1
<location> `ramalama/cli.py:572-574` </location>
<code_context>
+ print("No benchmark results found")
+ return
+
+ if args.format == "json":
+ output = (asdict(item) for item in results)
+ print(json.dumps(output, indent=2, sort_keys=True))
+ else:
+ print_bench_results(results)
</code_context>
<issue_to_address>
**issue (bug_risk):** JSON output path builds a generator, which `json.dumps` cannot serialize.
`output` is a generator (`(asdict(item) for item in results)`), which `json.dumps` cannot serialize and will raise a `TypeError`. Please convert to a concrete structure first, e.g. `output = [asdict(item) for item in results]`, or switch to a streaming JSON approach.
</issue_to_address>
### Comment 2
<location> `ramalama/benchmarks/schemas.py:114` </location>
<code_context>
+ configuration: TestConfigurationV1
+ result: LlamaBenchResultV1
+ version: Literal["v1"] = "v1"
+ created_at: str = datetime.now(timezone.utc).isoformat()
+ device: DeviceInfoV1 = field(default_factory=DeviceInfoV1.current_device_info)
+
</code_context>
<issue_to_address>
**issue (bug_risk):** `created_at` default is evaluated at import time, so all records share the same timestamp.
Because this default is evaluated at class definition time, every `BenchmarkRecordV1` created without an explicit `created_at` will have the same timestamp. Use a `default_factory`, e.g. `field(default_factory=lambda: datetime.now(timezone.utc).isoformat())`, to generate a fresh value per instance.
</issue_to_address>
### Comment 3
<location> `ramalama/benchmarks/schemas.py:202-206` </location>
<code_context>
+ raise NotImplementedError(f"No supported benchmark schemas for version {version}")
+
+
+def normalize_benchmark_record(benchmark: BenchmarkRecord) -> BenchmarkRecordV1:
+ if isinstance(benchmark, BenchmarkRecordV1):
+ return benchmark
+
+ raise NotImplementedError(f"Received an unsupported benchmark record type {type(BenchmarkRecord)}")
</code_context>
<issue_to_address>
**issue (bug_risk):** Error message uses `type(BenchmarkRecord)` instead of the actual `benchmark` instance.
This will always report the base class, not the actual runtime type, which makes the error misleading. Using `type(benchmark)` would correctly show the unexpected concrete type received.
</issue_to_address>
### Comment 4
<location> `ramalama/benchmarks/schemas.py:40-47` </location>
<code_context>
+ container_runtime: str = ""
+ inference_engine: str = ""
+ version: Literal["v1"] = "v1"
+ runtime_args: dict[str, Any] | None = None
+
+
</code_context>
<issue_to_address>
**suggestion:** The declared type of `runtime_args` does not match how it is populated.
In `BaseTransport.bench`, `runtime_args` receives `cmd` (a `list[str]`), but it’s annotated as `dict[str, Any] | None`. Please align the annotation with actual usage (e.g., `list[str] | None` or `object`) or change the caller to pass a mapping instead of a list.
```suggestion
@dataclass
class TestConfigurationV1(TestConfiguration):
"""Container configuration metadata for a benchmark run."""
container_image: str = ""
container_runtime: str = ""
inference_engine: str = ""
version: Literal["v1"] = "v1"
runtime_args: list[str] | None = None
```
</issue_to_address>
### Comment 5
<location> `ramalama/config.py:153-154` </location>
<code_context>
+ version: ClassVar[Any]
+
+
+@dataclass
+class DeviceInfoV1(DeviceInfo):
+ hostname: str
</code_context>
<issue_to_address>
**issue (bug_risk):** Config field name `storage_folder` conflicts with documented `db_path` key.
Unless there’s explicit aliasing between `db_path` and `storage_folder`, values set in the config file will be ignored and the default will always be used. Please either rename the field to match the documented key or introduce a compatibility mapping so existing configs continue to work as expected.
</issue_to_address>
### Comment 6
<location> `ramalama/transports/base.py:472` </location>
<code_context>
result = subprocess.CompletedProcess(args=escaped_cmd, returncode=0, stdout="", stderr="")
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature for persisting and viewing benchmark results. It adds a new benchmarks CLI command, enhances the bench command with structured JSON output, and includes configuration and documentation for these changes. The implementation is comprehensive, covering data schemas, storage management, CLI integration, and testing.
My review focuses on ensuring correctness, maintainability, and consistency between the code and its documentation. I've identified a few issues, including documentation inaccuracies regarding the storage mechanism (JSONL vs. SQLite), a potential bug with a mutable default in a dataclass, and an issue with JSON serialization of a generator. Addressing these points will improve the robustness and usability of this new feature.
|
@olliewalsh something went wonky with #2237 and I had to open a new PR. Sorry! This contains the suggested documentation changes and refactor from sqlite to jsonl. |
ramalama/config.py
Outdated
|
|
||
| def get_default_benchmarks_storage_folder() -> Path: | ||
| conf_dir = None | ||
| for dir in DEFAULT_CONFIG_DIRS: |
There was a problem hiding this comment.
this should use the store dir. The config dir may not be writable
There was a problem hiding this comment.
👍. Ideally we'd be able to default the benchmarks storage directory to the user specified store path in this case. Unfortunately, is_set doesn't currently support checking nested subfields so I put it off and instead had it use the default store path as the default benchmarks base path.
The config is getting pretty complicated at this point. Do you think it'd be worth bringing something like pydantic in and managing this through the likes of BaseSettings?
|
@olliewalsh This should be good to go if you're ready to merge. |
ramalama/cli.py
Outdated
|
|
||
| except MissingStorageFolderError: | ||
| print("Error: RAMALAMA__BENCHMARKS_STORAGE_FOLDER not configured") | ||
| sys.exit(1) |
There was a problem hiding this comment.
Better to raise an exception and let main() handle the exit code
There was a problem hiding this comment.
Might make sense to pull that code out of the try/except altogether in that case.
ramalama/transports/base.py
Outdated
| else: | ||
| dry_run(cmd) | ||
|
|
||
| result = subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="") |
There was a problem hiding this comment.
Is this necessary? Everything else just returns None
There was a problem hiding this comment.
It's not necessary, it just keeps the typing consistent without the early return. I'll switch out to an early return instead though since it's cleaner.
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
…ords Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Summary by Sourcery
Introduce persistent benchmark result storage and a new CLI for viewing historical benchmarks, while enriching bench output formatting and configuration support.
New Features:
benchmarksCLI command with subcommands (currentlylist) to view stored benchmark results in table or JSON formats with pagination.benchcommand outputs as structured benchmark records, capturing device, configuration, and result metadata for later inspection.Enhancements:
benchcommand to support selectable output formats (table or JSON) and render richer tabular benchmark summaries including model parameters and throughput.Documentation:
ramalama-benchmarks(1)man page, while updating existing man pages and examples to reference the new commands and options.Tests: